16 research outputs found

    "Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment"

    Get PDF
    This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future.Postprint (published version

    "Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment"

    Get PDF
    This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future

    Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters

    Get PDF
    MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.This work has been supported by the Spanish Ministry of Education (PID2019-107255GB-C22) and the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    Mapping parallel loops on multicore systems

    Get PDF
    The compute nodes in contemporary HPC systems contain one or more multicore processors. As a result, these nodes constitute a shared-memory multiprocessor, often combining CMP and SMT concurrency technologies. This configuration introduces different levels of sharing in the cache hierarchy, resulting in non-uniform data sharing overheads. In this paper we analyze the data-sharing patterns that exhibit a real multithreaded application when executing on a multicore system, with emphasis in the use of the shared last level cache (LLC) for the concurrent threads. As a consequence of this study, we explore the loop mapping problem in such systems with the aim of optimizing the shared use of the the LLC by all parallel threads. We propose a three-phase loop mapping strategy that deals with workload imbalances, minimizes cache sharing interferences, and maximizes intra-core and inter-core data reuse in the cache hierarchy. Preliminary results show some benefits of our approach. However, this is a work in progress and much more research is being done.Postprint (author’s final draft

    Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

    Get PDF
    Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.Peer ReviewedPostprint (author's final draft

    Noise inspector tool

    Get PDF
    The operating system noise can interfere with normal execution programs. This behavior is becoming especially important when scaling parallel programs and amplified with global synchronizations. This work presents a tool to detect in a non-intrusive way the alien programs that share resources with current running applications in a multicore cluster.The authors acknowledge the support of the BSC (Barcelona Supercomputing Centre). We would like to thank the anonymous reviewers for their comments, which helped us to improve the manuscript. This work has been supported by the Spanish Ministry of Science and Innovation (contract TIN2015-65316), the Spanish Ministry of Economy, Industry and Competitiveness (HAR2014-57776-P) and the Generalitat de Catalunya (2014 SGR-1051).Peer ReviewedPostprint (author's final draft

    OpenCAL++: An object-oriented architecture for transparent parallel execution of cellular automata models

    Get PDF
    Cellular Automata (CA) models, initially studied by John von Neumann, have been developed by numerous researchers and applied in both academic and scientific fields. Thanks to their local and independent rules, simulations of complex systems can be easily implemented based on CA modelling on parallel machines. However, due to the heterogeneity of the components - from the hardware to the software perspective-the various possible scenarios running parallelism in today’s architectures can pose a challenge in such implementations, making it difficult to exploit. This paper presents OpenCAL++, a transparent and efficient object-oriented platform for the parallel execution of cellular automata models. The architecture of OpenCAL++ ensures the modeller a fully transparent parallel execution and a strong ”separation of concerns” between the execution parallelism issues and the model implementation. The code implementing the Cellular Automata model remains the same whether the execution performs in a shared-, distributed-memory or a GPGPU context, irrespective of the optimizations adopted. To this aim, the object-oriented paradigm has been intensely exploited. As well as the OpenCAL++ architecture, we present the description of a simple Cellular Automata model implementation for illustrative purposes.This research was funded by the Italian “ICSC National Center for HPC, Big Data and Quantum Computing” Project, CN00000013 (approved under the Call M42C –Investment 1.4 – Avvisto “Centri Nazionali” – D.D. n. 3138 of 16.12.2021, admitted to financing with MUR Decree n. 1031 of 06.17.2022)Peer ReviewedPostprint (author's final draft

    Tareador: a tool to unveil parallelization strategies at undergraduate level

    Get PDF
    This paper presents a methodology and framework designed to assist students in the process of finding appropriate task decomposition strategies for their sequential program, as well as identifying bottlenecks in the later execution of the parallel program. One of the main components of this framework is Tareador, which provides a simple API to specify potential task decomposition strategies for a sequential program. Once the student proposes how to break the sequential code into tasks, Tareador 1) provides information about the dependences between tasks that should be honored when implementing that task decomposition using a parallel programming model; and 2) estimates the potential parallelism that could be achieved in an ideal parallel architecture with infinite processors; and 3) sim- ulates the parallel execution on an ideal architecture estimating the potential speed–up that could be achieved on a number of processors. The pedagogical style of the methodology is currently applied to teach parallelism in a third-year compulsory subject in the Bachelor Degree in Informatics Engineering at the Barcelona School of Informatics of the Universitat Politècnica de Catalunya (UPC) - BarcelonaTech.Peer ReviewedPostprint (published version

    Examen Final

    No full text
    ARISO
    corecore